Example Notebook (Risk Factors for Cervical Cancer)

Load a dataset

In this notebook a dataset named 'Risk Factors for Cervical Cancer'. The dataset was collected at 'Hospital Universitario de Caracas' in Caracas, Venezuela. The dataset comprises demographic information, habits, and historic medical records of 858 patients. Several patients decided not to answer some of the questions because of privacy concerns (missing values).

Preprocess the dataset

The dataset will be used same as described here: https://christophm.github.io/interpretable-ml-book/cervical.html All unknown values (\?) are going to be set to 0.0.

Visualize the dataset

Three visualization functions offered by the XAI module will be used for analyzing the dataset.

Target

In the cell below the target variable is selected. The Biopsy serves as the gold standard for diagnosing cervical cancer, therefore we will use it as target.

Training the models

Four models are going to be trained on this dataset. In the output below we can see accuracy, classification reports, confusion matrix and ROC Curve for each model.

Global model interpretations

In the following steps we will use global interpretation techniques that help us to answer questions like how does a model behave in general? What features drive predictions and what features are completely useless. This data may be very important in understanding the model better. Most of the techniques work by investigating the conditional interactions between the target variable and the features on the complete dataset.

Feature importance

The importance of a feature is the increase in the prediction error of the model after we permuted the feature’s values, which breaks the relationship between the feature and the true outcome. A feature is “important” if permuting it increases the model error. This is because in that case, the model relied heavily on this feature for making right prediction. On the other hand, a feature is “unimportant” if permuting it doesn’t affect the error by much or doesn’t change it at all.

ELI5

In the first case, we use ELI5, which does not permute the features but only visualizes the weight of each feature.